In the previous exercise, we learned how to build a simple linear model on our dataset. You were asked to build a linear model to predict housing prices. This exercise will be very similar but now, we will incorporate training, validation and testing.
The goal of this exercise is to give you a simple, hands-on experience with the process of training, validating and testing your data. The exercise is split up in a manner that will lead you through the entire process step-by-step starting from partitioning your data all the way to testing your generated model. This should reinforce the knowledge gained about what role each set plays in the supervised learning process. Additionally, it will give you a basic idea of problems like model and feature selection which are a common and crucial part of the supervised learning process. If you are unclear about the purpose of each dataset, go back to the Cross-Validation and Overfitting slides and review slide 4 before starting this exercise.
As data, we will be using a slightly modified version of the Boston Housing Dataset. Good luck!
In [68]:
# import libraries
import matplotlib
import IPython
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import pylab
import seaborn as sns
import sklearn as sk
%matplotlib inline
housing = """Read the .csv file containing the housing data"""
Before we can begin training our model and testing it, it is important to first properly partition the data into three subsets: Training, Validation and Test Set. We will be using the Holdout method for the purposes of Cross-Validation. Make sure that there is NO data overlap between these datasets. Also, remember that the Test set is only used once we are fully satisfied that our model is properly trained.
In [69]:
housing_training_set = """Training set goes here"""
housing_validation_set = """Validation set goes here"""
housing_test_set = """Testing set goes here"""
In [70]:
# Define your two predictors and response here
X_1 = """First Model Predictor"""
X_2 = """Second Model Predictor"""
Y = """Model Response"""
In [71]:
from sklearn.linear_model import LinearRegression
# Define your two models with different features (ex. 'tax', 'pratio') here. Feel free to change the names of the models.
lin_mod_param1 = LinearRegression()
lin_mod_param2 = LinearRegression()
# Train both models on the training data
Now, given the two trained models, we want to determine which model is more accurate at making predictions on unseen data. To do this, we will 'test' both models on the validation set created earlier and determine which one performs better on this set. Remember that we are still in the training phase! The better performing set will be used during the testing phase.
In [ ]:
# Use the validation set to evaluate the performance of both models
# Hint you can use a method provided by sklearn.linear_model.LinearRegression for this purpose
Once you have chosen one of the models from above, train the model on training set combined with the validation set to complete the training phase. Pandas has a useful method to concantenate two datasets that can help you here.
In [74]:
# Concatenate the training and validation set
# Train your model on the combined dataset
At this point, you should have selected one of the models from above as the model that you will use to predict median house values. We are in the testing phase! We will now test the selected model to see how well it performs on unseen test data. Use the test data set that you created at the beginning of the exercise to test your model.
Before we can begin testing though, we should train the selected model on the training set combined with the validation set
In [72]:
# Test your model here